AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help
To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.
The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit cards the customer has, and different channels through which customers have contacted the bank for any queries (including visiting the bank, online and through a call center).
Data Dictionary
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import pdist, cdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
#Format:
pd.options.display.float_format = '{:.4f}'.format
%matplotlib inline
# loading the dataset
data = pd.read_excel("Credit_Card_Customer_Data.xlsx")
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 547 | 548 | 38125 | 26000 | 4 | 5 | 2 | 4 |
| 353 | 354 | 94437 | 9000 | 5 | 4 | 1 | 3 |
| 499 | 500 | 65825 | 68000 | 6 | 4 | 2 | 2 |
| 173 | 174 | 38410 | 9000 | 2 | 1 | 5 | 8 |
| 241 | 242 | 81878 | 10000 | 4 | 5 | 1 | 3 |
| 341 | 342 | 70779 | 18000 | 4 | 3 | 2 | 0 |
| 647 | 648 | 79953 | 183000 | 9 | 0 | 9 | 2 |
| 218 | 219 | 28208 | 19000 | 3 | 1 | 5 | 7 |
| 120 | 121 | 16577 | 10000 | 4 | 2 | 4 | 6 |
| 134 | 135 | 31256 | 13000 | 4 | 1 | 5 | 7 |
print('There are {row} records, and {col} columns in the dataset'.format(row=data.shape[0], col=data.shape[1]))
There are 660 records, and 7 columns in the dataset
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 660 entries, 0 to 659 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sl_No 660 non-null int64 1 Customer Key 660 non-null int64 2 Avg_Credit_Limit 660 non-null int64 3 Total_Credit_Cards 660 non-null int64 4 Total_visits_bank 660 non-null int64 5 Total_visits_online 660 non-null int64 6 Total_calls_made 660 non-null int64 dtypes: int64(7) memory usage: 36.2 KB
Observations
# Let's look at the statistical summary of the data
df.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Sl_No | 660.0000 | 330.5000 | 190.6699 | 1.0000 | 165.7500 | 330.5000 | 495.2500 | 660.0000 |
| Customer Key | 660.0000 | 55141.4439 | 25627.7722 | 11265.0000 | 33825.2500 | 53874.5000 | 77202.5000 | 99843.0000 |
| Avg_Credit_Limit | 660.0000 | 34574.2424 | 37625.4878 | 3000.0000 | 10000.0000 | 18000.0000 | 48000.0000 | 200000.0000 |
| Total_Credit_Cards | 660.0000 | 4.7061 | 2.1678 | 1.0000 | 3.0000 | 5.0000 | 6.0000 | 10.0000 |
| Total_visits_bank | 660.0000 | 2.4030 | 1.6318 | 0.0000 | 1.0000 | 2.0000 | 4.0000 | 5.0000 |
| Total_visits_online | 660.0000 | 2.6061 | 2.9357 | 0.0000 | 1.0000 | 2.0000 | 4.0000 | 15.0000 |
| Total_calls_made | 660.0000 | 3.5833 | 2.8653 | 0.0000 | 1.0000 | 3.0000 | 5.0000 | 10.0000 |
Observations
# dropping the serial no. column as it does not provide any information
df.drop(["Sl_No"], axis=1, inplace=True)
# checking for missing values
df.isna().sum()
Customer Key 0 Avg_Credit_Limit 0 Total_Credit_Cards 0 Total_visits_bank 0 Total_visits_online 0 Total_calls_made 0 dtype: int64
df.duplicated().sum()
0
#Checking for number of unique values in each variables
df.nunique()
Customer Key 655 Avg_Credit_Limit 110 Total_Credit_Cards 10 Total_visits_bank 6 Total_visits_online 16 Total_calls_made 11 dtype: int64
data[data['Customer Key'].isin(data[data['Customer Key'].duplicated()]['Customer Key'].tolist())].sort_values('Customer Key')
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 48 | 49 | 37252 | 6000 | 4 | 0 | 2 | 8 |
| 432 | 433 | 37252 | 59000 | 6 | 2 | 1 | 2 |
| 4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
| 332 | 333 | 47437 | 17000 | 7 | 3 | 1 | 0 |
| 411 | 412 | 50706 | 44000 | 4 | 5 | 0 | 2 |
| 541 | 542 | 50706 | 60000 | 7 | 5 | 2 | 2 |
| 391 | 392 | 96929 | 13000 | 4 | 5 | 0 | 0 |
| 398 | 399 | 96929 | 67000 | 6 | 2 | 2 | 2 |
| 104 | 105 | 97935 | 17000 | 2 | 1 | 2 | 10 |
| 632 | 633 | 97935 | 187000 | 7 | 1 | 7 | 0 |
# Subseting the data for further analysis
subset_df = df.drop("Customer Key", axis=1)
subset_df.columns
Index(['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank',
'Total_visits_online', 'Total_calls_made'],
dtype='object')
num_cols = subset_df.select_dtypes(include=np.number).columns.tolist()
num_cols
['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
value_count_cols = ['Total_Credit_Cards',
'Total_visits_bank',
'Total_visits_online',
'Total_calls_made']
for column in value_count_cols:
print(df[column].value_counts())
print("-" * 50)
4 151 6 117 7 101 5 74 2 64 1 59 3 53 10 19 9 11 8 11 Name: Total_Credit_Cards, dtype: int64 -------------------------------------------------- 2 158 1 112 0 100 3 100 5 98 4 92 Name: Total_visits_bank, dtype: int64 -------------------------------------------------- 2 189 0 144 1 109 4 69 5 54 3 44 15 10 7 7 10 6 12 6 8 6 11 5 13 5 9 4 14 1 6 1 Name: Total_visits_online, dtype: int64 -------------------------------------------------- 4 108 0 97 2 91 1 90 3 83 6 39 7 35 9 32 8 30 5 29 10 26 Name: Total_calls_made, dtype: int64 --------------------------------------------------
Observation
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# selecting numerical columns
for item in num_cols:
histogram_boxplot(subset_df, item, bins=50, kde=True, figsize=(10, 5))
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
for i in range(len(value_count_cols)):
labeled_barplot(data, value_count_cols[i])
labeled_barplot(data, value_count_cols[i], perc=True)
all_col = subset_df.iloc[:,:].columns.tolist()
fig, axes = plt.subplots(1, 5, figsize=(20, 6))
fig.suptitle('Histogram of numerical variables', fontsize=20)
counter = 0
for ii in range(5):
sns.countplot(ax=axes[ii],x=subset_df[all_col[counter]])
counter = counter+1
fig.tight_layout(pad=4.0)
## Number of customers who used all three services
subset_df[(subset_df['Total_calls_made']!=0) & (subset_df['Total_visits_bank']!=0) & (subset_df['Total_visits_online']!=0)].count()
Avg_Credit_Limit 356 Total_Credit_Cards 356 Total_visits_bank 356 Total_visits_online 356 Total_calls_made 356 dtype: int64
## Number of customers who never used all three services
subset_df[(subset_df['Total_calls_made']==0) & (subset_df['Total_visits_bank']==0) & (subset_df['Total_visits_online']==0)].count()
Avg_Credit_Limit 0 Total_Credit_Cards 0 Total_visits_bank 0 Total_visits_online 0 Total_calls_made 0 dtype: int64
from pandas_profiling import ProfileReport
profile = ProfileReport(subset_df)
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Advantages of Univariate Analysis
Helps to check the distribution/spread of the data primarily using histograms and box plots.
Also Numerical feature can be plot using different options like violin plot and cumulative density distribution plot.
Best way to understand data before starting the analysis or modeling.
sns.pairplot(subset_df, diag_kind='kde');
for i in range(len(value_count_cols)):
sns.lineplot(subset_df['Total_Credit_Cards'],subset_df[value_count_cols[i]])
Let's check for correlations.
plt.figure(figsize=(15, 7))
sns.heatmap(
subset_df[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
subset_df.isnull().sum()
Avg_Credit_Limit 0 Total_Credit_Cards 0 Total_visits_bank 0 Total_visits_online 0 Total_calls_made 0 dtype: int64
Let's find the percentage of outliers, in each column of the data, using IQR.
Treating outliers
We will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively.
Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.
# Finding the 25th percentile and 75th percentile for the numerical columns.
Q1 = data[num_cols].quantile(0.25)
Q3 = data[num_cols].quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th percentile - 25th percentile)
lower_whisker = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5 * IQR
# Percentage of outliers in each column
((subset_df[num_cols] < lower_whisker) | (subset_df[num_cols] > upper_whisker)).sum() / subset_df.shape[
0
] * 100
Avg_Credit_Limit 5.9091 Total_Credit_Cards 0.0000 Total_visits_bank 0.0000 Total_visits_online 5.6061 Total_calls_made 0.0000 dtype: float64
# Plotting Boxplots of all our features to get idea of distribution and outliers
plt.figure(figsize=(18,6))
sns.boxplot(x="value", y="variable", data=pd.melt(subset_df))
plt.title('Boxplots of all variables', size=15)
plt.show()
Q1 = subset_df["Avg_Credit_Limit"].quantile(0.25) # 25th quantile
Q3 = subset_df["Avg_Credit_Limit"].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
print(lower_whisker)
print(upper_whisker)
-47000.0 105000.0
subset_df.loc[subset_df["Avg_Credit_Limit"] > 105000.0].sort_values("Avg_Credit_Limit", ascending=False)
| Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|
| 630 | 200000 | 10 | 0 | 13 | 0 |
| 650 | 195000 | 10 | 1 | 12 | 2 |
| 631 | 195000 | 8 | 0 | 15 | 0 |
| 632 | 187000 | 7 | 1 | 7 | 0 |
| 646 | 186000 | 7 | 0 | 8 | 1 |
| 639 | 184000 | 7 | 1 | 15 | 2 |
| 647 | 183000 | 9 | 0 | 9 | 2 |
| 624 | 178000 | 7 | 0 | 11 | 0 |
| 622 | 176000 | 10 | 1 | 15 | 2 |
| 637 | 173000 | 9 | 1 | 11 | 0 |
| 658 | 172000 | 10 | 1 | 15 | 0 |
| 654 | 172000 | 10 | 1 | 9 | 1 |
| 645 | 171000 | 10 | 0 | 15 | 0 |
| 659 | 167000 | 9 | 0 | 12 | 2 |
| 621 | 166000 | 9 | 1 | 12 | 2 |
| 623 | 166000 | 10 | 0 | 7 | 0 |
| 614 | 163000 | 8 | 1 | 7 | 1 |
| 633 | 163000 | 7 | 1 | 10 | 1 |
| 619 | 158000 | 7 | 0 | 13 | 0 |
| 612 | 157000 | 9 | 1 | 14 | 1 |
| 652 | 156000 | 8 | 1 | 8 | 0 |
| 626 | 156000 | 9 | 1 | 10 | 2 |
| 629 | 155000 | 8 | 0 | 7 | 2 |
| 638 | 153000 | 8 | 1 | 7 | 1 |
| 627 | 146000 | 10 | 0 | 12 | 1 |
| 657 | 145000 | 8 | 1 | 9 | 1 |
| 641 | 144000 | 10 | 0 | 10 | 2 |
| 617 | 136000 | 8 | 0 | 13 | 0 |
| 651 | 132000 | 9 | 1 | 12 | 2 |
| 615 | 131000 | 9 | 1 | 10 | 1 |
| 644 | 127000 | 10 | 1 | 15 | 1 |
| 636 | 126000 | 10 | 1 | 8 | 0 |
| 640 | 123000 | 8 | 1 | 15 | 2 |
| 618 | 121000 | 7 | 0 | 13 | 2 |
| 635 | 114000 | 10 | 1 | 7 | 2 |
| 649 | 112000 | 10 | 1 | 8 | 1 |
| 648 | 111000 | 8 | 1 | 7 | 0 |
| 620 | 108000 | 10 | 0 | 15 | 1 |
| 634 | 106000 | 8 | 0 | 8 | 1 |
Q1 = subset_df["Total_visits_online"].quantile(0.25) # 25th quantile
Q3 = subset_df["Total_visits_online"].quantile(0.75) # 75th quantile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower_whisker = Q1 - 1.5 * IQR
upper_whisker = Q3 + 1.5 * IQR
print(lower_whisker)
print(upper_whisker)
-3.5 8.5
subset_df.loc[subset_df["Total_visits_online"] > 8.5].sort_values("Total_visits_online", ascending=False)
| Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|
| 631 | 195000 | 8 | 0 | 15 | 0 |
| 622 | 176000 | 10 | 1 | 15 | 2 |
| 658 | 172000 | 10 | 1 | 15 | 0 |
| 628 | 84000 | 9 | 1 | 15 | 0 |
| 653 | 95000 | 10 | 0 | 15 | 1 |
| 640 | 123000 | 8 | 1 | 15 | 2 |
| 644 | 127000 | 10 | 1 | 15 | 1 |
| 645 | 171000 | 10 | 0 | 15 | 0 |
| 639 | 184000 | 7 | 1 | 15 | 2 |
| 620 | 108000 | 10 | 0 | 15 | 1 |
| 612 | 157000 | 9 | 1 | 14 | 1 |
| 619 | 158000 | 7 | 0 | 13 | 0 |
| 618 | 121000 | 7 | 0 | 13 | 2 |
| 617 | 136000 | 8 | 0 | 13 | 0 |
| 656 | 84000 | 10 | 1 | 13 | 2 |
| 630 | 200000 | 10 | 0 | 13 | 0 |
| 650 | 195000 | 10 | 1 | 12 | 2 |
| 651 | 132000 | 9 | 1 | 12 | 2 |
| 659 | 167000 | 9 | 0 | 12 | 2 |
| 621 | 166000 | 9 | 1 | 12 | 2 |
| 4 | 100000 | 6 | 0 | 12 | 3 |
| 627 | 146000 | 10 | 0 | 12 | 1 |
| 616 | 96000 | 10 | 1 | 11 | 2 |
| 6 | 100000 | 5 | 0 | 11 | 2 |
| 613 | 94000 | 9 | 1 | 11 | 0 |
| 624 | 178000 | 7 | 0 | 11 | 0 |
| 637 | 173000 | 9 | 1 | 11 | 0 |
| 655 | 99000 | 10 | 1 | 10 | 0 |
| 615 | 131000 | 9 | 1 | 10 | 1 |
| 633 | 163000 | 7 | 1 | 10 | 1 |
| 626 | 156000 | 9 | 1 | 10 | 2 |
| 641 | 144000 | 10 | 0 | 10 | 2 |
| 1 | 50000 | 3 | 0 | 10 | 9 |
| 647 | 183000 | 9 | 0 | 9 | 2 |
| 654 | 172000 | 10 | 1 | 9 | 1 |
| 642 | 97000 | 10 | 1 | 9 | 2 |
| 657 | 145000 | 8 | 1 | 9 | 1 |
Both the outliers column values look continuous and which might be valid/useful information for business analysis. Also this values can form a separate cluster/segment too.
So, we are not going to treat outliers in this case.
Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:
If you have multiple independent variables like age, salary, and height, With their range as (18–100 Years), (25,000–75,000 Euros), and (1–2 Meters) respectively, feature scaling would help them all to be in the same range.
Feature scaling is specially relevant in machine learning models that compute some sort of distance metric, like most clustering methods like K-Means.
So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.
StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
from scipy.stats import zscore
subset_scaled_df=subset_df.apply(zscore)
subset_scaled_df.head(10)
| Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|
| 0 | 1.7402 | -1.2492 | -0.8605 | -0.5475 | -1.2515 |
| 1 | 0.4103 | -0.7876 | -1.4737 | 2.5205 | 1.8919 |
| 2 | 0.4103 | 1.0590 | -0.8605 | 0.1343 | 0.1455 |
| 3 | -0.1217 | 0.1357 | -0.8605 | -0.5475 | 0.1455 |
| 4 | 1.7402 | 0.5973 | -1.4737 | 3.2023 | -0.2037 |
| 5 | -0.3876 | -0.7876 | -1.4737 | -0.5475 | 1.5426 |
| 6 | 1.7402 | 0.1357 | -1.4737 | 2.8614 | -0.5530 |
| 7 | -0.5206 | -0.7876 | -1.4737 | -0.5475 | -0.9023 |
| 8 | -0.7866 | -1.2492 | -1.4737 | -0.2066 | -0.5530 |
| 9 | -0.8398 | -0.3259 | -1.4737 | -0.5475 | 1.1933 |
sc = StandardScaler()
subset_scaled_df1 = pd.DataFrame(sc.fit_transform(subset_df),columns=subset_df.columns)
subset_scaled_df1.head(10)
| Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|
| 0 | 1.7402 | -1.2492 | -0.8605 | -0.5475 | -1.2515 |
| 1 | 0.4103 | -0.7876 | -1.4737 | 2.5205 | 1.8919 |
| 2 | 0.4103 | 1.0590 | -0.8605 | 0.1343 | 0.1455 |
| 3 | -0.1217 | 0.1357 | -0.8605 | -0.5475 | 0.1455 |
| 4 | 1.7402 | 0.5973 | -1.4737 | 3.2023 | -0.2037 |
| 5 | -0.3876 | -0.7876 | -1.4737 | -0.5475 | 1.5426 |
| 6 | 1.7402 | 0.1357 | -1.4737 | 2.8614 | -0.5530 |
| 7 | -0.5206 | -0.7876 | -1.4737 | -0.5475 | -0.9023 |
| 8 | -0.7866 | -1.2492 | -1.4737 | -0.2066 | -0.5530 |
| 9 | -0.8398 | -0.3259 | -1.4737 | -0.5475 | 1.1933 |
clusters = range(1, 10)
meanDistortions = [] # Create a empty list
errors = [] #Create empty list for errors
for k in clusters:
model = KMeans(n_clusters=k) # Initialize KMeans
model.fit(subset_scaled_df) # Fit kMeans on the data
prediction = model.predict(subset_df) # Predict the model on the data
distortion = (
sum(np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1))
/ subset_df.shape[0] # Find distortion
)
meanDistortions.append(
distortion
) # Append distortion values to the empty list created above
errors.append(model.inertia_)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
print("\nCluster Error:", errors)
plt.plot(clusters, meanDistortions, "bo-")
plt.xlabel("k") # Title of X-axis
plt.ylabel("Average Distortion") # Title of y-axis
plt.title("Selecting k with the Elbow Method", fontsize=20) # Title of the plot
Number of Clusters: 1 Average Distortion: 2.0069222262503614 Number of Clusters: 2 Average Distortion: 1.4571553548514269 Number of Clusters: 3 Average Distortion: 1.1466276549150365 Number of Clusters: 4 Average Distortion: 1.0463825294774465 Number of Clusters: 5 Average Distortion: 0.9908683849620168 Number of Clusters: 6 Average Distortion: 0.9430843103448057 Number of Clusters: 7 Average Distortion: 0.9094700983137036 Number of Clusters: 8 Average Distortion: 0.8922456881128548 Number of Clusters: 9 Average Distortion: 0.8633274546350456 Cluster Error: [3300.0000000000005, 2040.9898164784952, 933.0437490000534, 780.7736895551768, 704.4759188657506, 642.1490713311521, 600.049543334103, 567.2035125446861, 534.2648682537475]
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
kev = KElbowVisualizer(
estimator = KMeans(),
k = 10,
metric = "silhouette"
# optional metric : "distortion"(default, "calinski_harabasz"
).fit(subset_scaled_df)
kev.show();
Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].
From the silhouette scores, it seems that 7 is a good value of k.Let's do deeper investigation for different K values using SilhouetteVisualizer
sil_score = [] # Create empty list
cluster_list = list(range(2, 10)) # Creating a list of range from 2 to 10
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters) # Initializing KMeans algorithm
preds = clusterer.fit_predict((subset_scaled_df)) # Predicting on the data
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds) # Cacalculating silhouette score
sil_score.append(score) # Appending silhouette score to empty list created above
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.41842496663215445 For n_clusters = 3, silhouette score is 0.5157182558881063 For n_clusters = 4, silhouette score is 0.3556670619372605 For n_clusters = 5, silhouette score is 0.2726898791817692 For n_clusters = 6, silhouette score is 0.25583657571102003 For n_clusters = 7, silhouette score is 0.24847787091758058 For n_clusters = 8, silhouette score is 0.2266798472313902 For n_clusters = 9, silhouette score is 0.21852430691339222
[<matplotlib.lines.Line2D at 0x2448f5a1820>]
Below are the points that every cluster needs to satisfy for selecting the optimal K-value.
Note
for ii in range(9, 1, -1):
visualizer = SilhouetteVisualizer(KMeans(ii, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Observation
%%time
# Creating Kmeans model with 3 clusters and fitting our scaled data to the the model
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(subset_scaled_df)
CPU times: total: 344 ms Wall time: 102 ms
KMeans(n_clusters=3, random_state=0)
kmeans.labels_
array([0, 1, 0, 0, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
# adding kmeans cluster labels to the original dataframe
subset_df["K_means_segments"] = kmeans.labels_
K_Means_df = df
K_Means_df
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | K_means_segments | |
|---|---|---|---|---|---|---|---|
| 0 | 87073 | 100000 | 2 | 1 | 1 | 0 | 0 |
| 1 | 38414 | 50000 | 3 | 0 | 10 | 9 | 1 |
| 2 | 17341 | 50000 | 7 | 1 | 3 | 4 | 0 |
| 3 | 40496 | 30000 | 5 | 1 | 1 | 4 | 0 |
| 4 | 47437 | 100000 | 6 | 0 | 12 | 3 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 655 | 51108 | 99000 | 10 | 1 | 10 | 0 | 2 |
| 656 | 60732 | 84000 | 10 | 1 | 13 | 2 | 2 |
| 657 | 53834 | 145000 | 8 | 1 | 9 | 1 | 2 |
| 658 | 80655 | 172000 | 10 | 1 | 15 | 0 | 2 |
| 659 | 80150 | 167000 | 9 | 0 | 12 | 2 | 2 |
660 rows × 7 columns
df[df["K_means_segments"] == 0]
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | K_means_segments | |
|---|---|---|---|---|---|---|---|
| 0 | 87073 | 100000 | 2 | 1 | 1 | 0 | 0 |
| 2 | 17341 | 50000 | 7 | 1 | 3 | 4 | 0 |
| 3 | 40496 | 30000 | 5 | 1 | 1 | 4 | 0 |
| 228 | 54838 | 10000 | 7 | 2 | 0 | 0 | 0 |
| 229 | 35254 | 8000 | 7 | 2 | 1 | 4 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 607 | 90191 | 34000 | 6 | 3 | 1 | 0 | 0 |
| 608 | 49341 | 47000 | 4 | 4 | 1 | 0 | 0 |
| 609 | 11562 | 38000 | 4 | 3 | 2 | 0 | 0 |
| 610 | 16253 | 39000 | 7 | 2 | 1 | 0 | 0 |
| 611 | 80623 | 46000 | 7 | 4 | 1 | 3 | 0 |
386 rows × 7 columns
cluster_profile = df.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segment"] = (
df.groupby("K_means_segments")["Customer Key"].count().values
)
cluster_profile["count_in_each_segment"] = (
df.groupby("K_means_segments").count().values
)
# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | count_in_each_segment | |
|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||
| 0 | 54881.329016 | 33782.383420 | 5.515544 | 3.489637 | 0.981865 | 2.000000 | 386 |
| 1 | 55239.830357 | 12174.107143 | 2.410714 | 0.933036 | 3.553571 | 6.870536 | 224 |
| 2 | 56708.760000 | 141040.000000 | 8.740000 | 0.600000 | 10.900000 | 1.080000 | 50 |
import plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter3d(
x= df['Total_visits_bank'],
y= df['Total_visits_online'],
z= df['Total_calls_made'],
mode='markers',
marker=dict(
color = df['K_means_segments'],
size= 20,
line=dict(
color= df['K_means_segments'],
width= 12
),
opacity=0.8
),
)
d = [trace1]
layout = go.Layout(
title= 'K-Means Clusters',
scene = dict(
xaxis = dict(title = 'Bank Visits'),
yaxis = dict(title = 'Online Visits'),
zaxis = dict(title = 'Customer Service Calls')
)
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of numerical variables for each cluster")
counter = 0
for ii in range(5):
sns.boxplot(ax=axes[ii], y=df[num_cols[counter]], x=df["K_means_segments"])
counter = counter + 1
fig.tight_layout(pad=2.0)
Cluster 0:
This Cluster customers prefer to handle bank transactions in person rather than calling or doing it online.
Cluster 1:
This Cluster Customers prefer to call the bank to solve any issues rather than visiting the bank or online banking.
Total_Calls_Made count range between 1 and 10 with 75% of the customers in this cluster\segment calls the bank or its customer service department at least 5 times.
Cluster 2:
This Cluster Customers to do online/digital banking to solve their issues rather than calling or visiting the bank.
covMatrix = np.cov(subset_scaled_df,rowvar=False)
print(covMatrix)
[[ 1.00151745 0.60978441 -0.10046453 0.55222122 -0.41498065] [ 0.60978441 1.00151745 0.31627526 0.16801286 -0.65223974] [-0.10046453 0.31627526 1.00151745 -0.55269882 -0.50678371] [ 0.55222122 0.16801286 -0.55269882 1.00151745 0.12749255] [-0.41498065 -0.65223974 -0.50678371 0.12749255 1.00151745]]
pca = PCA(n_components=5)
pca.fit(subset_scaled_df)
PCA(n_components=5)
#The eigen Values
print(pca.explained_variance_)
[2.29029906 1.87416377 0.3205531 0.27826563 0.24430569]
#The eigen Vectors
print(pca.components_)
[[ 0.48885901 0.59767943 0.28049227 0.11178284 -0.55912866] [ 0.40323959 0.03017103 -0.58658661 0.66516108 0.22352726] [ 0.00346147 -0.28498293 -0.6135219 -0.3049481 -0.67035086] [ 0.30861651 -0.74135216 0.44527763 0.31838769 -0.23560546] [-0.70933671 0.10512237 0.05058609 0.59220036 -0.36404738]]
#And the percentage of variation explained by each eigen Vector
print(pca.explained_variance_ratio_)
[0.45736578 0.37426483 0.06401348 0.0555688 0.04878711]
plt.bar(list(range(1,6)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
plt.step(list(range(1,6)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
Now 3 dimensions seems very reasonable. With 3 variables we can explain over 95% of the variation in the original data!
pca3 = PCA(n_components=3)
pca3.fit(subset_scaled_df)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(subset_scaled_df)
[[ 0.48885901 0.59767943 0.28049227 0.11178284 -0.55912866] [ 0.40323959 0.03017103 -0.58658661 0.66516108 0.22352726] [ 0.00346147 -0.28498293 -0.6135219 -0.3049481 -0.67035086]] [0.45736578 0.37426483 0.06401348]
PCA_df = pd.DataFrame(Xpca3)
PCA_df.head()
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 0.5013 | 0.5248 | 1.8959 |
| 1 | -1.4596 | 3.1056 | -0.9068 |
| 2 | 0.5258 | 0.8240 | 0.0890 |
| 3 | -0.3623 | 0.1281 | 0.5582 |
| 4 | 1.2662 | 3.6687 | -0.1000 |
sns.pairplot(PCA_df)
<seaborn.axisgrid.PairGrid at 0x24490c134f0>
Let's find the Cophenetic correlation for different distances with different linkage methods.
The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
Cophenetic correlation is the way to compare two or more dendrograms.
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.7391220243806552. Cophenetic correlation for Euclidean distance and complete linkage is 0.8599730607972423. Cophenetic correlation for Euclidean distance and average linkage is 0.8977080867389372. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8861746814895477. Cophenetic correlation for Chebyshev distance and single linkage is 0.7382354769296767. Cophenetic correlation for Chebyshev distance and complete linkage is 0.8533474836336782. Cophenetic correlation for Chebyshev distance and average linkage is 0.8974159511838106. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8913624010768603. Cophenetic correlation for Mahalanobis distance and single linkage is 0.7058064784553606. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.5422791209801747. Cophenetic correlation for Mahalanobis distance and average linkage is 0.8326994115042134. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7805990615142516. Cophenetic correlation for Cityblock distance and single linkage is 0.7252379350252723. Cophenetic correlation for Cityblock distance and complete linkage is 0.8731477899179829. Cophenetic correlation for Cityblock distance and average linkage is 0.896329431104133. Cophenetic correlation for Cityblock distance and weighted linkage is 0.8825520731498188.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8977080867389372, which is obtained with Euclidean distance and average linkage.
Let's explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.7391220243806552. Cophenetic correlation for complete linkage is 0.8599730607972423. Cophenetic correlation for average linkage is 0.8977080867389372. Cophenetic correlation for centroid linkage is 0.8939385846326323. Cophenetic correlation for ward linkage is 0.7415156284827493. Cophenetic correlation for weighted linkage is 0.8861746814895477.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8977080867389372, which is obtained with average linkage.
We see that the cophenetic correlation is maximum with Euclidean distance and average linkage.
Let's see the dendrograms for the different linkage methods.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations
Observations
The optimal number of clusters from a dendrogram can be obtained by deciding where to cut the cluster tree. Generally, the cluster tree is cut where dendrogram height is maximum as it generally corresponds to distinct and homogeneous clusters. The dendrogram for average linkage had the highest cophenetic correlation. So we can use average linkage and have chosen 3 clusters as the dendrogram height is pretty high. (The maximum height is for 2, but that would not be meaningful as it would give one clusters with too many data and its hard to contact customers as per their preferences).
Lets visualize the dendrogram cut for average link in the below plot
# list of linkage methods
linkage_methods = ["average"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 15))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs)
axs.set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs.annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
dendrogram(Z, color_threshold=3.2)
plt.axhline(y=3.2, c="red", lw=1, linestyle="dashdot")
<matplotlib.lines.Line2D at 0x24497dc1e80>
HCmodel = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(linkage='average', n_clusters=3)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
HC_Model_df = df
HC_Model_df
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | K_means_segments | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|
| 0 | 87073 | 100000 | 2 | 1 | 1 | 0 | 0 | 0 |
| 1 | 38414 | 50000 | 3 | 0 | 10 | 9 | 1 | 2 |
| 2 | 17341 | 50000 | 7 | 1 | 3 | 4 | 0 | 0 |
| 3 | 40496 | 30000 | 5 | 1 | 1 | 4 | 0 | 0 |
| 4 | 47437 | 100000 | 6 | 0 | 12 | 3 | 2 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 655 | 51108 | 99000 | 10 | 1 | 10 | 0 | 2 | 1 |
| 656 | 60732 | 84000 | 10 | 1 | 13 | 2 | 2 | 1 |
| 657 | 53834 | 145000 | 8 | 1 | 9 | 1 | 2 | 1 |
| 658 | 80655 | 172000 | 10 | 1 | 15 | 0 | 2 | 1 |
| 659 | 80150 | 167000 | 9 | 0 | 12 | 2 | 2 | 1 |
660 rows × 8 columns
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Customer Key"].count().values
)
df['HC_Clusters'].value_counts()
0 387 2 223 1 50 Name: HC_Clusters, dtype: int64
Average linkage seems have a good cluster, but ward always give homogeneous clusters. So we can just try with ward.
Lets visualize the dendrogram cut for ward linkage in the below plot
# list of linkage methods
linkage_methods = ["ward"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 15))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs)
axs.set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs.annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
dendrogram(Z, color_threshold=18)
plt.axhline(y=18, c="red", lw=1, linestyle="dashdot")
<matplotlib.lines.Line2D at 0x2449cac2610>
%%time
# Agglomerative Clustering with 3 clusters, ward linkage and euclidean as the metric
HCmodel = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
# Calculting the silhouette score for the Agglomerative clustering
print(silhouette_score(subset_scaled_df, HCmodel.labels_, metric="euclidean"))
0.5413025796835845 CPU times: total: 78.1 ms Wall time: 52.6 ms
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
df['HC_Clusters'].value_counts()
0 387 1 223 2 50 Name: HC_Clusters, dtype: int64
cluster_profile = df.groupby("HC_Clusters").mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Customer Key"].count().values
)
# let's see the names of the countries in each cluster
for cl in df["HC_Clusters"].unique():
print(
"The",
df[df["HC_Clusters"] == cl]["Customer Key"].nunique(),
"countries in cluster",
cl,
"are:",
)
print(df[df["HC_Clusters"] == cl]["Customer Key"].unique())
print("-" * 100, "\n")
The 385 countries in cluster 0 are: [87073 17341 40496 54838 35254 46635 97825 83125 35483 15129 83290 56486 31903 45909 14263 46813 81878 35549 85799 39122 81531 69965 18595 44398 32352 40898 27101 33457 45088 23302 27408 65372 21531 56843 17165 89328 20072 71402 47496 24808 17036 67193 34423 97109 55382 51811 53936 66504 53207 18514 51319 36340 36934 95925 49771 22919 21233 74544 52025 45652 73952 49418 77026 49331 75775 54906 94666 11698 34677 95610 41380 38033 85337 38994 67911 92956 77641 57565 53814 30712 19785 31384 16374 50878 78002 83459 91987 51552 72156 24998 45673 11596 87485 28414 81863 33240 11466 23881 44645 49844 92782 22824 26767 26678 50412 17933 34495 47437 22610 41159 64672 62483 85614 96548 19137 69028 70779 38244 67046 64897 46223 36628 17565 77381 11799 81940 66706 87838 94437 33790 44402 29886 66804 47866 61996 15318 89635 71681 71862 96186 22348 36243 88807 82376 98126 80347 17649 62807 92522 57459 44579 45476 61994 11398 24702 27824 45878 72431 19215 23409 16418 85122 55060 55478 65574 31113 96929 78912 68439 62864 31515 77954 88207 78618 31551 75792 29864 45440 97954 90189 55090 17703 33991 88884 45808 50706 92140 88123 53932 65908 25321 87456 48602 97530 48657 76209 49913 53002 61122 82807 93496 64519 31950 23110 96297 28408 37252 41287 52460 26604 58019 87219 36839 12663 48667 42887 14439 60851 41266 37438 65747 81166 20570 14816 11265 24980 37934 70707 84351 89446 17325 64774 53166 45341 94595 55170 92489 92933 36504 40508 15798 70101 77613 84360 48402 46776 67258 44804 29919 65781 12456 62649 74446 36632 76024 75065 51682 18397 29102 56367 95147 44379 76957 42921 23102 61324 49690 20043 44144 53552 62530 41741 22842 65825 77826 61216 83192 82023 73000 64550 90131 17382 27117 94529 21717 81910 76492 43000 48692 27476 15086 43034 99131 13140 99437 91242 39285 63710 90860 35585 58708 57451 69868 43679 30256 26334 47848 17377 39644 29176 55706 51771 83585 51867 68040 75417 34775 85645 83545 44157 38125 75398 90999 70376 33295 80942 26493 97850 43841 79885 59316 83466 81510 35268 11734 88411 96269 87683 26063 42479 58116 67282 84888 75366 14377 59074 96534 31870 24748 68920 67637 60839 59170 90586 56270 87670 47703 35421 58511 76398 93310 36836 46373 94700 67860 99473 68862 93381 46548 74083 48660 13720 72339 99284 47198 67415 44403 58276 85234 31948 90191 49341 11562 16253 80623] ---------------------------------------------------------------------------------------------------- The 223 countries in cluster 1 are: [38414 58634 37376 82490 44770 52741 52326 92503 25084 68517 55196 62617 96463 39137 14309 29794 87241 30507 61061 24001 68067 65034 14854 81130 29112 13999 32550 82164 61517 28254 30888 46388 74126 52142 37659 83132 20119 52363 50769 68502 99026 83326 62040 37252 74625 51182 60301 96386 43886 78503 68419 32828 17937 71632 81566 29759 36929 70248 91673 61355 60403 85868 76205 66524 69214 21976 35149 27120 18821 33187 93482 90168 71881 59656 12026 99589 38970 57990 39447 79694 79403 47296 37559 38165 49198 18007 59619 37016 91099 74704 25742 11937 52736 88338 18916 92501 96213 26599 73007 97935 26089 14946 74795 73435 41634 84069 83244 87291 18086 33369 15310 98499 35256 89007 93997 16577 25440 81116 63663 69811 36111 39454 70199 11602 49697 28701 61627 34103 14248 31256 45583 52750 95507 23743 53410 53898 66200 58389 61347 59151 37802 60475 95489 77758 23768 87471 85707 97951 54785 97011 35103 18564 61009 24054 63751 52758 78473 80457 59783 64241 32374 97536 33110 36978 54281 98602 97687 28842 38410 38261 20524 37671 25330 41787 11412 55892 95495 41946 86410 76718 98969 77143 38205 53851 52783 63405 48510 97463 18145 14398 98288 69704 29058 15546 16715 87350 13215 20593 56624 33317 99596 72430 16676 40486 90958 67212 44226 94251 61776 55275 18609 54477 12122 28208 68003 79632 73811 72892 51773 96163 61234 55849 56156] ---------------------------------------------------------------------------------------------------- The 50 countries in cluster 2 are: [47437 48370 94391 50598 40019 77910 89832 98216 54495 47650 32107 84192 53916 32584 97285 20337 15585 20620 75009 76203 33837 14916 97935 16180 49493 70974 40217 88442 17538 90839 99843 27212 91575 60190 18519 48762 58392 79953 13315 30570 78996 78404 28525 51826 65750 51108 60732 53834 80655 80150] ----------------------------------------------------------------------------------------------------
Both linkage method average & ward have almost same type of clusters. So we can use anything.
# lets display cluster profile
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||
| 0 | 54925.966408 | 33713.178295 | 5.511628 | 3.485788 | 0.984496 | 2.005168 | 0.002584 | 387 |
| 1 | 55163.973094 | 12197.309417 | 2.403587 | 0.928251 | 3.560538 | 6.883408 | 1.000000 | 223 |
| 2 | 56708.760000 | 141040.000000 | 8.740000 | 0.600000 | 10.900000 | 1.080000 | 2.000000 | 50 |
sns.pairplot(subset_scaled_df, diag_kind='kde', hue='HC_Clusters');
fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of original numerical variables for each cluster")
counter = 0
for ii in range(5):
sns.boxplot(
ax=axes[ii],
y=df[num_cols[counter]],
x=df["HC_Clusters"],
)
counter = counter + 1
fig.tight_layout(pad=2.0)
Cluster 0:
This Cluster customers prefer to handle bank transactions in person rather than calling or doing it online.
Cluster 1:
This Cluster Customers prefer to call the bank to solve any issues rather than visiting the bank or online banking.
Total_Calls_Made count range between 1 and 10 with 75% of the customers in this cluster\segment calls the bank or its customer service department at least 5 times.
Cluster 2:
This Cluster Customers do online/digital banking to solve their issues rather than calling or visiting the bank.
PCA_Cluster_df = PCA_df.copy()
PCA_Cluster_df['PCA_HC_Clusters'] = HCmodel.labels_
#Renaming the columns for easy identification
PCA_Cluster_df = PCA_Cluster_df.rename(columns={0:'component_1', 1:'component_2', 2:'component_3'})
PCA_Cluster_df
| component_1 | component_2 | component_3 | PCA_HC_Clusters | |
|---|---|---|---|---|
| 0 | 0.5013 | 0.5248 | 1.8959 | 0 |
| 1 | -1.4596 | 3.1056 | -0.9068 | 1 |
| 2 | 0.5258 | 0.8240 | 0.0890 | 0 |
| 3 | -0.3623 | 0.1281 | 0.5582 | 0 |
| 4 | 1.2662 | 3.6687 | -0.1000 | 2 |
| ... | ... | ... | ... | ... |
| 655 | 3.0385 | 2.6662 | -0.0923 | 2 |
| 656 | 2.5672 | 3.3417 | -0.8738 | 2 |
| 657 | 2.8514 | 2.9831 | 0.0449 | 2 |
| 658 | 4.1783 | 4.5829 | -0.6053 | 2 |
| 659 | 3.1604 | 4.3510 | -0.2544 | 2 |
660 rows × 4 columns
sns.pairplot(PCA_Cluster_df, diag_kind='kde', hue='PCA_HC_Clusters');
import plotly as py
import plotly.graph_objs as go
trace1 = go.Scatter3d(
x= PCA_Cluster_df['component_1'],
y= PCA_Cluster_df['component_2'],
z= PCA_Cluster_df['component_3'],
mode='markers',
marker=dict(
color = PCA_Cluster_df['PCA_HC_Clusters'],
size= 20,
line=dict(
color= PCA_Cluster_df['PCA_HC_Clusters'],
width= 12
),
opacity=0.8
),
)
d = [trace1]
layout = go.Layout(
title= 'PCA HC Clusters',
scene = dict(
xaxis = dict(title = 'Component 1'),
yaxis = dict(title = 'Component 2'),
zaxis = dict(title = 'Component 3')
)
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
sns.scatterplot(
data=PCA_Cluster_df,
x="component_1",
y="component_2",
hue="PCA_HC_Clusters",
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x2449fd1bb50>
PCA_Cluster_df.boxplot(by='PCA_HC_Clusters', layout=(3,2), figsize=(10,14));
subset_scaled_df.boxplot(by='HC_Clusters', layout=(3,2), figsize=(10,14));
df[df["Customer Key"].isin([47437, 37252, 97935, 96929, 50706])].sort_values('Customer Key')
| Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | K_means_segments | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|
| 48 | 37252 | 6000 | 4 | 0 | 2 | 8 | 1 | 1 |
| 432 | 37252 | 59000 | 6 | 2 | 1 | 2 | 0 | 0 |
| 4 | 47437 | 100000 | 6 | 0 | 12 | 3 | 2 | 2 |
| 332 | 47437 | 17000 | 7 | 3 | 1 | 0 | 0 | 0 |
| 411 | 50706 | 44000 | 4 | 5 | 0 | 2 | 0 | 0 |
| 541 | 50706 | 60000 | 7 | 5 | 2 | 2 | 0 | 0 |
| 391 | 96929 | 13000 | 4 | 5 | 0 | 0 | 0 | 0 |
| 398 | 96929 | 67000 | 6 | 2 | 2 | 2 | 0 | 0 |
| 104 | 97935 | 17000 | 2 | 1 | 2 | 10 | 1 | 1 |
| 632 | 97935 | 187000 | 7 | 1 | 7 | 0 | 2 | 2 |
Observations
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,10))
ax1 = plt.subplot(1,2,1)
plt.title('KMeans Predicted Classes')
sns.scatterplot(x='Avg_Credit_Limit', y='Total_Credit_Cards', style='K_means_segments', hue='K_means_segments',data=K_Means_df,ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.title('Hierarchical Predicted Classes')
sns.scatterplot(x='Avg_Credit_Limit', y='Total_Credit_Cards', style = 'HC_Clusters',hue='HC_Clusters', data=HC_Model_df, ax=ax2)
plt.show()
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,10))
ax1 = plt.subplot(1,2,1)
plt.title('KMeans Predicted Classes')
sns.scatterplot(x='Avg_Credit_Limit', y='Total_visits_online', style='K_means_segments', hue='K_means_segments',data=K_Means_df,ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.title('Hierarchical Predicted Classes')
sns.scatterplot(x='Avg_Credit_Limit', y='Total_visits_online', style = 'HC_Clusters',hue='HC_Clusters', data=HC_Model_df, ax=ax2)
plt.show()
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,10))
ax1 = plt.subplot(1,2,1)
plt.title('KMeans Predicted Classes')
sns.scatterplot(x='Total_visits_bank', y='Total_calls_made', style='K_means_segments', hue='K_means_segments',data=K_Means_df,ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.title('Hierarchical Predicted Classes')
sns.scatterplot(x='Total_visits_bank', y='Total_calls_made', style = 'HC_Clusters',hue='HC_Clusters', data=HC_Model_df, ax=ax2)
plt.show()
K Means clustering needed advance knowledge of K i.e. no. of clusters one want to divide your data.In hierarchical clustering one can stop at any number of clusters, one find appropriate by interpreting the dendrogram.
K-Means Clustering is Centroid based, Hierarchical Clustering is partition-based and can either be divisive (Top-down) or agglomerative(bottom-up).
K-Means Clustering - The Elbow method used to find clusters Hierarchical Clustering - Dendrogram is used
But The results of both of our clustering are almost similar for this dataset.
It might be possible to see a difference, when we have a very large dataset and that shape of clusters may differ a little.
Time for execution:
Distinct clusters:
Silhouette scores:
Number of observations:
There are almost similar clusters in both algorithms. Below are the counts.
AllLife bank can use any clustering technique KMeans or Hierarchical in segmenting its customers base, since both are equally optimal with respect to efficient(time wise), distinct clusters, and silhouette score.
As we see there are three categories/segments of customers with each segment having a preference for communication channel with bank, it is recommended that products are marketed to specific segment of customers through their preferred channel.
Also, additional services can be provided based on how they connect with bank and also based on their spending pattern which can deduced from average credit limit.
At the same time the bank should try to increase the credit limit and the number of credit cards for customers in these cluters perhaps by offering incentives, bonuses, or waiving some fees.
This approach should be more intensive to customers in cluster 0 since customers in cluster 0 have the least credit limit and the least number of cards..
Customers with high credit limit tend to vist online. Hence they can be targeted for online campaigns, coupons and accordingly products and services can be offered to them
Whereas customers with comparatively low credit limit make visits to bank more often, hence they can either be prompted with benefits of online banking or can be catered with in-bank offers/services and flyers
Customers with low credit limits have less frequency on online platform, they can be marketed with benefits of online banking and/or make customer call center reps aware of promotions and offers so that they can target this segment of customers
Based on how bank wants to promote its products and services, a segment of customers can be targeted as we know their prefered mode of commmunication with bank